1.1.1 (a) Read in the data and assess the stationarity of the raw stock indices.
We are going to use two methods. The first one, is to visualize the stock indices over time to get a sense of whether they appear stationary or not; using plot().
Code
# Create interactive plots for each indexplot_sp500 <-plot_ly(x = sp500.df$DATE, y = sp500.df$SP500, type ="scatter", mode ="lines", name ="SP500") %>%layout(title ="SP500")plot_cac40 <-plot_ly(x = cac40.df$DATE, y = cac40.df$CAC40, type ="scatter", mode ="lines", name ="CAC40") %>%layout(title ="CAC40")plot_nasdaq <-plot_ly(x = nasdaq.df$DATE, y = nasdaq.df$NASDAQ, type ="scatter", mode ="lines", name ="Nasdaq") %>%layout(title ="Nasdaq")plot_nikkei <-plot_ly(x = nikkei.df$DATE, y = nikkei.df$NIKKEI, type ="scatter", mode ="lines", name ="NIKKEI") %>%layout(title ="NIKKEI")combined_plot <-subplot(plot_sp500, plot_cac40, plot_nasdaq, plot_nikkei, nrows =2)# Set a common title for the entire gridcombined_plot <-layout(combined_plot, title ="Comparison of Stock Indices")# Show the combined plotcombined_plot
It appears that none of the stock indices are stationary based on the plots generated. These plots clearly show the presence of trends, which are a strong indicator of non-stationarity data. For instance, if we take the NASDAQ index as an example, it exhibits a noticeable upward trend between 1999 and the beginning of 2000.
Another common approach is to use the Augmented-Dickey-Fuller (ADF) test. If we can reject the null hypothesis (p-value < 0.05), then we can consider the time-series to be stationary. The null hypothesis of this test is that the time series is non-stationary. If the p-value is less than your significance level, you can reject the null hypothesis and consider the series stationary.
Code
# ADF - Test adf_test_sp500 <-adf.test(sp500$SP500) ; adf_test_sp500## ## Augmented Dickey-Fuller Test## ## data: sp500$SP500## Dickey-Fuller = -1.212849, Lag order = 15, p-value = 0.904429## alternative hypothesis: stationaryadf_test_nasdaq <-adf.test(nasdaq$NASDAQ) ; adf_test_nasdaq## ## Augmented Dickey-Fuller Test## ## data: nasdaq$NASDAQ## Dickey-Fuller = -1.570559, Lag order = 13, p-value = 0.760138## alternative hypothesis: stationaryadf_test_cac40 <-adf.test(cac40$CAC40) ; adf_test_cac40## ## Augmented Dickey-Fuller Test## ## data: cac40$CAC40## Dickey-Fuller = -0.7332045, Lag order = 13, p-value = 0.96759## alternative hypothesis: stationaryadf_test_nikkei <-adf.test(nikkei$NIKKEI) ; adf_test_nikkei## ## Augmented Dickey-Fuller Test## ## data: nikkei$NIKKEI## Dickey-Fuller = -2.419329, Lag order = 13, p-value = 0.400792## alternative hypothesis: stationary
We can’t reject any of the null hypothesis, implying that we do not have enough evidence to conclude that the data is stationary.
1.1.2 (b) Create a function to transform the daily stock indices into their daily negative log returns counterparts. Plot the latter series and assess their stationarity. To compare the series, also plot the negative log returns on a common scale to all indices.
Code
# Define a function to calculate negative log returnscalculate_log_returns <-function(stocks) { log_returns <--diff(log(stocks))return(log_returns)}# Calculate negative log returns for each indexlog_returns_sp500 <-calculate_log_returns(sp500) %>%na.omit()log_returns_cac40 <-calculate_log_returns(cac40) %>%na.omit()log_returns_nasdaq <-calculate_log_returns(nasdaq) %>%na.omit()log_returns_nikkei <-calculate_log_returns(nikkei) %>%na.omit()# We need to transform the index of rows as a separate variable to include the date into the plots: we create a functionlog_returns_sp500$Dates <-rownames(log_returns_sp500)log_returns_sp500.df <-data.frame(log_returns_sp500)log_returns_cac40$Dates <-rownames(log_returns_cac40)log_returns_cac40.df <-data.frame(log_returns_cac40)log_returns_nasdaq$Dates <-rownames(log_returns_nasdaq)log_returns_nasdaq.df <-data.frame(log_returns_nasdaq)log_returns_nikkei$Dates <-rownames(log_returns_nikkei)log_returns_nikkei.df <-data.frame(log_returns_nikkei)# Plot for SP500plot_sp5002 <-plot_ly(x = log_returns_sp500.df$Dates, y = log_returns_sp500, type ="scatter", mode ="lines", name ="SP500") %>%layout(title ="SP500 Negative Log Returns", xaxis =list(title ="Time"), yaxis =list(title ="Negative Log Returns"), shapes =list(list(type ="line", x0 =1, x1 =length(log_returns_sp500), y0 =0, y1 =0, line =list(color ="red"))))# Plot for CAC40plot_cac402 <-plot_ly(x = log_returns_cac40.df$Dates, y = log_returns_cac40, type ="scatter", mode ="lines", name ="CAC40") %>%layout(title ="CAC40 Negative Log Returns", xaxis =list(title ="Time"), yaxis =list(title ="Negative Log Returns"), shapes =list(list(type ="line", x0 =1, x1 =length(log_returns_cac40), y0 =0, y1 =0, line =list(color ="red"))))# Plot for Nasdaqplot_nasdaq2 <-plot_ly(x = log_returns_nasdaq.df$Dates , y = log_returns_nasdaq, type ="scatter", mode ="lines", name ="Nasdaq") %>%layout(title ="Nasdaq Negative Log Returns", xaxis =list(title ="Time"), yaxis =list(title ="Negative Log Returns"), shapes =list(list(type ="line", x0 =1, x1 =length(log_returns_nasdaq), y0 =0, y1 =0, line =list(color ="red"))))# Plot for NIKKEIplot_nikkei2 <-plot_ly(x = log_returns_nikkei.df$Dates , y = log_returns_nikkei, type ="scatter", mode ="lines", name ="NIKKEI") %>%layout(title ="Negative Log Returns", xaxis =list(title ="Time"), yaxis =list(title ="Negative Log Returns"), shapes =list(list(type ="line", x0 =1, x1 =length(log_returns_nikkei), y0 =0, y1 =0, line =list(color ="red"))))# Create a subplot with the list of individual plotscombined_plot2 <-subplot(plot_sp5002, plot_cac402, plot_nasdaq2, plot_nikkei2, nrows =2)combined_plot2 <-layout(combined_plot2, title ="Comparison of negative log return of Stock Indices (same scale)")# Show the combined plotcombined_plot2
Looking at the comparison of negative logarithm return of the stock indices, we discern the following:
mean is not constant: not totally centered around 0, implying that there is a long-term trend in the indexes.
variance is not constant: there are clear upward/downward trends. SP500 has very high spikes, such as the one in 1997-10-27.
Therefore, graphically, the indexes seem to be non-stationary.
1.1.3 (c) Draw histograms of the negative log returns and compare them to the Normal distribution. What do you observe?
Code
# Set the number of histogram binsnum_bins <-30# Create a common scale for the histogramspar(mfrow =c(2, 2)) # Create a 2x2 grid of plots# Plot histograms and overlay Normal distribution curveshist(log_returns_sp500, breaks = num_bins, main ="Histogram - SP500 Log Returns", xlab ="Log Returns", probability =TRUE, xlim=c(0.1,-0.1))curve(dnorm(x, mean =mean(log_returns_sp500), sd =sd(log_returns_sp500)), col ="red", lwd =2, add =TRUE)hist(log_returns_cac40, breaks = num_bins, main ="Histogram - CAC40 Log Returns", xlab ="Log Returns", probability =TRUE, xlim=c(0.1,-0.1))curve(dnorm(x, mean =mean(log_returns_cac40), sd =sd(log_returns_cac40)), col ="red", lwd =2, add =TRUE)hist(log_returns_nasdaq, breaks = num_bins, main ="Histogram - Nasdaq Log Returns", xlab ="Log Returns", probability =TRUE, xlim=c(0.1,-0.1))curve(dnorm(x, mean =mean(log_returns_nasdaq), sd =sd(log_returns_nasdaq)), col ="red", lwd =2, add =TRUE)hist(log_returns_nikkei, breaks = num_bins, main ="Histogram - NIKKEI Log Returns", xlab ="Log Returns", probability =TRUE, xlim=c(0.1,-0.1))curve(dnorm(x, mean =mean(log_returns_nikkei), sd =sd(log_returns_nikkei)), col ="red", lwd =2, add =TRUE)
Code
# Reset the plot layoutpar(mfrow =c(1, 1))
In a perfectly normal distribution, the data would closely follow the red curve. The histograms of the stock indices’ negative log returns, when compared to the overlaid normal distribution, indicate deviations from normality, suggesting the presence of skewness. Such deviations hold significant implications for the Black-Scholes formula, which assumes normally distributed returns and constant volatility. If returns exhibit heavy tails, as suggested by the histograms, the Black-Scholes model may not accurately capture the risk of extreme price movements, thereby potentially mispricing options that rely on these assumptions.
To validate these observations, further analytical methods will be employed later in this document to rigorously assess the distributional characteristics of the stock indices’ returns.
1.1.4 (d) Check the normality assumption of the negative log returns using QQ-plots. What is your conclusion?
Code
# Create QQ-plots for log returnspar(mfrow =c(2, 2)) # Create a 2x2 grid of plotsqqnorm(log_returns_sp500, main ="QQ-Plot - SP500 Log Returns")qqline(log_returns_sp500)qqnorm(log_returns_cac40, main ="QQ-Plot - CAC40 Log Returns")qqline(log_returns_cac40)qqnorm(log_returns_nasdaq, main ="QQ-Plot - Nasdaq Log Returns")qqline(log_returns_nasdaq)qqnorm(log_returns_nikkei, main ="QQ-Plot - NIKKEI Log Returns")qqline(log_returns_nikkei)
The QQ-plots for the negative log returns across various stock indices exhibit noticeable deviations from the expected diagonal line, particularly within the tails, which indicates a departure from normal distribution, with a tendency towards more pronounced extreme movements. The Nasdaq index, for instance, distinctly shows this trend with an increased frequency of significant negative returns. The differing scales of the plots also reveal a variable degree of volatility among the indices, with the Nasdaq displaying particularly stark fluctuations. This evidence challenges the assumption of normally distributed returns integral to the Black-Scholes model, suggesting potential inaccuracies in option pricing that relies on this assumption.
In conclusion, the observed anomalies in the QQ-plots signal that the normality assumption critical to the Black-Scholes framework is violated.
1.1.5 (e) Formally test the normality assumption of the negative log returns using an Anderson-Darling testing procedure. Do you reject the Normal hypothesis?
Code
# Perform Anderson-Darling test for normalityad_test_sp500 <-ad.test(log_returns_sp500)ad_test_cac40 <-ad.test(log_returns_cac40)ad_test_nasdaq <-ad.test(log_returns_nasdaq)ad_test_nikkei <-ad.test(log_returns_nikkei)# Print the resultsprint(ad_test_sp500)## ## Anderson-Darling normality test## ## data: log_returns_sp500## A = 29.23992, p-value < 2.22e-16print(ad_test_cac40)## ## Anderson-Darling normality test## ## data: log_returns_cac40## A = 10.3297, p-value < 2.22e-16print(ad_test_nasdaq)## ## Anderson-Darling normality test## ## data: log_returns_nasdaq## A = 29.42786, p-value < 2.22e-16print(ad_test_nikkei)## ## Anderson-Darling normality test## ## data: log_returns_nikkei## A = 8.189858, p-value < 2.22e-16
Upon conducting the Anderson-Darling test for normality on the negative log returns of the stock indices, we encounter p-values that effectively round down to zero ((p-value < 2.2e-16) , providing robust evidence to reject the null hypothesis of normality.
In summary, log returns of the indices are not normally distributed, violating one of the key assumptions behind the Black-Scholes formula.
1.1.6 (f) Use the fitdistr() function from the MASS package in order to obtain the (maximum-likelihood estimated) parameters of distributions you could imagine for the negative log returns. Try to fit at least two different distributions on the data and, using an information criteria (such as the AIC), decide which distributional framework fits best for each of the series.
Code
# Fit distributions : normalfit_sp500 <-fitdistr(log_returns_sp500, "normal")fit_cac40 <-fitdistr(log_returns_cac40, "normal")fit_nasdaq <-fitdistr(log_returns_nasdaq, "normal")fit_nikkei <-fitdistr(log_returns_nikkei, "normal")# Fit a t-distribution to log returnsfit_sp500_t <-fitdistr(log_returns_sp500, "t")fit_cac40_t <-fitdistr(log_returns_cac40, "t")fit_nasdaq_t <-fitdistr(log_returns_nasdaq, "t")fit_nikkei_t <-fitdistr(log_returns_nikkei, "t")# Calculate AIC for each distributionaic_sp500 <-AIC(fit_sp500)aic_cac40 <-AIC(fit_cac40)aic_nasdaq <-AIC(fit_nasdaq)aic_nikkei <-AIC(fit_nikkei)# Calculate AIC for each distributionaic_sp500_t <-AIC(fit_sp500_t)aic_cac40_t <-AIC(fit_cac40_t)aic_nasdaq_t <-AIC(fit_nasdaq_t)aic_nikkei_t <-AIC(fit_nikkei_t)# Create a data frame to store AIC valuesaic_data <-data.frame(Index =c("SP500", "CAC40", "NASDAQ", "NIKKEI"),Normal =c(aic_sp500, aic_cac40, aic_nasdaq, aic_nikkei),T_Distribution =c(aic_sp500_t, aic_cac40_t, aic_nasdaq_t, aic_nikkei_t))# Print the AIC data frameprint(aic_data)## Index Normal T_Distribution## 1 SP500 -22510.4993 -22944.6649## 2 CAC40 -14447.5481 -14626.0090## 3 NASDAQ -13384.9374 -13766.5445## 4 NIKKEI -14070.5771 -14216.9075# Create a bar plot of AIC values with improved aestheticsbarplot(t(as.matrix(aic_data[, -1])),beside =TRUE,col =c("#1f77b4", "#2ca02c"), # Changed to more appealing colorsnames.arg = aic_data$Index,main ="AIC Comparison of Distributions",xlab ="Index",ylab ="AIC Value",cex.names =0.8, cex.axis =0.8, cex.main =1, cex.lab =0.9, las =1, border =NA, legend.text =c("Normal", "T-Distribution"),args.legend =list(x ="bottomright", bty ="n", cex =0.8))# You can also add a grid for better readabilitygrid(nx =NA, ny =NULL, col ="gray", lty ="dotted", lwd =par("lwd"))
Upon fitting both normal and t-distributions to the data and evaluating them with the Akaike Information Criterion (AIC), it is important to note that both methods yield similar AIC values. However, the t-distribution consistently offers a better fit for all four indices, as indicated by its slightly lower AIC values. This suggests that while the normal distribution provides a close representation of the data, the student’s t-distribution is more likely to minimize information loss when representing the underlying structure of the data, capturing the tails of the distributions more effectively.
As a result, we will continue analyzing the indices with the Student distribution.
1.1.7 (g) If this has not been done in (f), fit a t-distribution to the negative log returns using fitdistr().Using a QQ-plot for each of the series, decide whether the fit is better than with a Normal distribution, based on your answer in (d).
Code
# Set up the plotting area for a 4x2 gridpar(mfrow =c(4, 2), mar =c(4, 4, 2, 1), oma =c(0, 0, 3, 0))# Normal distribution QQ plotsqqnorm(log_returns_sp500, main ="Normal QQ-Plot - SP500")qqline(log_returns_sp500)qqnorm(log_returns_cac40, main ="Normal QQ-Plot - CAC40")qqline(log_returns_cac40)qqnorm(log_returns_nasdaq, main ="Normal QQ-Plot - NASDAQ")qqline(log_returns_nasdaq)qqnorm(log_returns_nikkei, main ="Normal QQ-Plot - NIKKEI")qqline(log_returns_nikkei)# t-distribution QQ plotsqqplot(qt(ppoints(log_returns_sp500), df=fit_sp500_t$estimate["df"]), log_returns_sp500, main ="t-dist QQ-Plot - SP500", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")abline(0, 1, col ="red")qqplot(qt(ppoints(log_returns_cac40), df=fit_cac40_t$estimate["df"]), log_returns_cac40, main ="t-dist QQ-Plot - CAC40", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")abline(0, 1, col ="red")qqplot(qt(ppoints(log_returns_nasdaq), df=fit_nasdaq_t$estimate["df"]), log_returns_nasdaq, main ="t-dist QQ-Plot - NASDAQ", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")abline(0, 1, col ="red")qqplot(qt(ppoints(log_returns_nikkei), df=fit_nikkei_t$estimate["df"]), log_returns_nikkei, main ="t-dist QQ-Plot - NIKKEI", xlab ="Theoretical Quantiles", ylab ="Sample Quantiles")abline(0, 1, col ="red")
Code
# Reset the plotting areapar(mfrow =c(1, 1))
Upon examining the QQ plots, the data points in the t-distribution QQ plots adhere more closely to the reference line, particularly in the tails, suggesting that the t-distribution provides a superior fit for the negative log returns of the indices compared to the normal distribution. This improved alignment, especially in the tails, indicates that the t-distribution more accurately captures the behavior of extreme market movements.
1.2 Part 2: Financial time series, volatility and the random walk hypothesis
1.2.1 (a) Plot the ACF of all the series in Part 1 (i.e. the raw series as well as the negative log returns). What do you observe?
Code
############################################################## RAW SERIES ############################################################### par(mfrow =c(2, 2)) # Create a 2x2 grid of plotspar(oma =c(0, 0, 3, 0)) # Adjust the outer margins to make space for the title# Plot ACF for the S&P 500 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(sp500, main ="ACF of S&P 500 Index")# Plot ACF for the CAC 40 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(cac40, main ="ACF of CAC 40 Index")# Plot ACF for the NASDAQ index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(nasdaq, main ="ACF of NASDAQ Index")# Plot ACF for the Nikkei index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(nikkei, main ="ACF of Nikkei Index")# Add the main titlemtext("ACF of raw series", outer =TRUE, cex =1.5, line =1)
Code
############################################################## NEGATIVE LOG RETURNS ################################################################ # Setting up the 2x2 plot area with adjusted outer margins for the titlepar(mfrow =c(2, 2), oma =c(0, 0, 2, 0))# Plot ACF for the S&P 500 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_sp500, main ="ACF of S&P 500 Index")# Plot ACF for the CAC 40 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_cac40, main ="ACF of CAC 40 Index")# Plot ACF for the NASDAQ index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_nasdaq, main ="ACF of NASDAQ Index")# Plot ACF for the Nikkei index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_nikkei, main ="ACF of Nikkei Index")# Add the central titlemtext("ACF of Negative Log Returns", outer =TRUE, cex =1.5, line =0)
The Autocorrelation Function (ACF) plots for both the raw series and the negative log returns reveal distinct patterns. For the raw series, the ACF shows a strong, consistent autocorrelation at all lags, indicating a non-stationary process where past values have a persistent influence on future values.
On the other hand, the ACF plots for the negative log returns show most autocorrelations within the confidence bands, implying no significant autocorrelation at any lag. This suggests that the log returns are closer to a stationary process, despite some lags being outside of the confidence bands, indicating that there may be occasional and minor deviations from strict stationarity.
1.2.2 (b) Use a Ljung-Box procedure to formally test for (temporal) serial dependence in the series. What is your conclusion?
Code
# For Raw Databox_test_sp500 <-Box.test(sp500, type ="Ljung-Box")box_test_cac40 <-Box.test(cac40, type ="Ljung-Box")box_test_nasdaq <-Box.test(nasdaq, type ="Ljung-Box")box_test_nikkei <-Box.test(nikkei, type ="Ljung-Box")# For Negative Log Returnsbox_test_log_sp500 <-Box.test(log_returns_sp500, type ="Ljung-Box")box_test_log_cac40 <-Box.test(log_returns_cac40, type ="Ljung-Box")box_test_log_nasdaq <-Box.test(log_returns_nasdaq, type ="Ljung-Box")box_test_log_nikkei <-Box.test(log_returns_nikkei, type ="Ljung-Box")# Print p-valuescat("Raw SP500 Ljung-Box Test - p-value:", box_test_sp500$p.value, "\n")## Raw SP500 Ljung-Box Test - p-value: 0cat("Raw CAC40 Ljung-Box Test - p-value:", box_test_cac40$p.value, "\n")## Raw CAC40 Ljung-Box Test - p-value: 0cat("Raw NASDAQ Ljung-Box Test - p-value:", box_test_nasdaq$p.value, "\n")## Raw NASDAQ Ljung-Box Test - p-value: 0cat("Raw NIKKEI Ljung-Box Test - p-value:", box_test_nikkei$p.value, "\n")## Raw NIKKEI Ljung-Box Test - p-value: 0cat("Log Returns SP500 Ljung-Box Test - p-value:", box_test_log_sp500$p.value, "\n")## Log Returns SP500 Ljung-Box Test - p-value: 0.932638483cat("Log Returns CAC40 Ljung-Box Test - p-value:", box_test_log_cac40$p.value, "\n")## Log Returns CAC40 Ljung-Box Test - p-value: 0.495130124cat("Log Returns NASDAQ Ljung-Box Test - p-value:", box_test_log_nasdaq$p.value, "\n")## Log Returns NASDAQ Ljung-Box Test - p-value: 0.409219414cat("Log Returns NIKKEI Ljung-Box Test - p-value:", box_test_log_nikkei$p.value, "\n")## Log Returns NIKKEI Ljung-Box Test - p-value: 0.0513979989
For the raw stock market indices, the results indicate strong evidence of serial dependence, with p-values significantly below 0.05 (<2.2e-16). This suggests that there is a temporal correlation in the raw stock market data, indicating that past observations influence future values.
However, when applying the Ljung-Box test to the negative log returns of these indices, the results show no significant serial dependence. Specifically, for log returns, the p-values are notably higher, with values ranging from 0.0514 to 0.933. This implies that after taking the logarithm of returns, the temporal correlation is greatly reduced, indicating a closer approximation to a stationary process. It’s worth noting that the Nikkei index stands out with a p-value of 0.05, signifying a higher degree of autocorrelation in its log returns compared to the other indices.
1.2.3 (c) Propose ARIMA models for each of the negative log returns series, based on visualisation tools (e.g. ACF and PACF). Select an ARIMA model using auto.arima() to each of the negative log returns series. Comment on the difference. Assess the residuals of the resulting models.
Code
########## PACF gives AR --> p is the order of AR component #################### par(mfrow =c(2, 2)) # Create a 2x2 grid of plotspar(oma =c(0, 0, 1, 0))par(mar =c(5, 4, 3, 2))pacf(log_returns_sp500)par(mar =c(5, 4, 3, 2))pacf(log_returns_cac40)par(mar =c(5, 4, 3, 2))pacf(log_returns_nasdaq)par(mar =c(5, 4, 3, 2))pacf(log_returns_nikkei)# Add the central titlemtext("PACF of Negative Log Returns", outer =TRUE, cex =1.5, line =-0.3)
Code
########## ACF gives MA --> q is the order of MAcomponent #################### # Setting up the 2x2 plot area with adjusted outer margins for the titlepar(mfrow =c(2, 2), oma =c(0, 0, 2, 0))# Plot ACF for the S&P 500 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_sp500, main ="ACF of S&P 500 Index")# Plot ACF for the CAC 40 index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_cac40, main ="ACF of CAC 40 Index")# Plot ACF for the NASDAQ index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_nasdaq, main ="ACF of NASDAQ Index")# Plot ACF for the Nikkei index with adjusted marginspar(mar =c(5, 4, 3, 2))acf(log_returns_nikkei, main ="ACF of Nikkei Index")# Add the central titlemtext("ACF of Negative Log Returns", outer =TRUE, cex =1.5, line =0)
Code
################## ADF TEST to see if need to differentiate or not ################## # ADF test for SP500 log returnsadf_sp500 <-adf.test(log_returns_sp500)print(paste("P-value for SP500:", adf_sp500$p.value))## [1] "P-value for SP500: 0.01"# ADF test for CAC40 log returnsadf_cac40 <-adf.test(log_returns_cac40)print(paste("P-value for CAC40:", adf_cac40$p.value))## [1] "P-value for CAC40: 0.01"# ADF test for NASDAQ log returnsadf_nasdaq <-adf.test(log_returns_nasdaq)print(paste("P-value for NASDAQ:", adf_nasdaq$p.value))## [1] "P-value for NASDAQ: 0.01"# ADF test for NIKKEI log returnsadf_nikkei <-adf.test(log_returns_nikkei)print(paste("P-value for NIKKEI:", adf_nikkei$p.value))## [1] "P-value for NIKKEI: 0.01"
As the p-value is smaller than 0 in the ADF test, we don’t need to differentiate because the null hypothesis of the time series is rejected, suggesting that the time series is already stationary.
Code
############################## CHOOSING OUR OWN ORDER ############################## # STEP 1: own arima##ARIMA SP500: order is p, d and q arima_sp500 <-Arima(log_returns_sp500, order =c(8,0,7)) # AIC of -22527arima_sp500## Series: log_returns_sp500 ## ARIMA(8,0,7) with non-zero mean ## ## Coefficients:## ar1 ar2 ar3 ar4 ar5 ar6## 0.475007 -0.230146 0.35635 -0.066815 0.110447 0.19875## s.e. 0.101402 0.102107 0.11911 0.113078 0.111588 0.11132## ar7 ar8 ma1 ma2 ma3 ma4## -0.791668 0.045867 -0.479227 0.215558 -0.377897 0.084015## s.e. 0.088481 0.020247 0.100359 0.108722 0.121459 0.116628## ma5 ma6 ma7 mean## -0.154838 -0.175357 0.771575 -0.000318## s.e. 0.115527 0.112882 0.096961 0.000171## ## sigma^2 = 0.000108777: log likelihood = 11285.55## AIC=-22537.1 AICc=-22536.93 BIC=-22431.96## ARIMA CAC40arima_cac40 <-Arima(log_returns_cac40, order =c(5,0,5)) # AIC of -14458arima_cac40## Series: log_returns_cac40 ## ARIMA(5,0,5) with non-zero mean ## ## Coefficients:## ar1 ar2 ar3 ar4 ar5 ma1## 0.962970 0.287472 -0.347933 0.056554 -0.244987 -0.953959## s.e. 0.119875 0.178070 0.166782 0.165495 NaN 0.120519## ma2 ma3 ma4 ma5 mean## -0.323555 0.314450 0.028540 0.212973 -0.000159## s.e. 0.176153 0.169551 0.166043 NaN 0.000280## ## sigma^2 = 0.000212653: log likelihood = 7241.06## AIC=-14458.13 AICc=-14458.01 BIC=-14387.88## ARIMA NASDAQarima_nasdaq <-Arima(log_returns_nasdaq, order =c(7,0,7))# AIC of -13412arima_nasdaq## Series: log_returns_nasdaq ## ARIMA(7,0,7) with non-zero mean ## ## Coefficients:## ar1 ar2 ar3 ar4 ar5 ar6## 0.244385 -0.247321 0.141274 0.125001 -0.144992 0.031856## s.e. 0.150974 0.187830 0.218332 0.220995 0.187704 0.156771## ar7 ma1 ma2 ma3 ma4 ma5## -0.814958 -0.223783 0.206986 -0.124798 -0.131296 0.097960## s.e. 0.136203 0.147045 0.181585 0.203839 0.202995 0.171941## ma6 ma7 mean## -0.031123 0.835889 -0.000365## s.e. 0.149967 0.130248 0.000344## ## sigma^2 = 0.00031862: log likelihood = 6721.84## AIC=-13411.69 AICc=-13411.48 BIC=-13318.02## ARIMA NIKKEIarima_nikkei <-Arima(log_returns_nikkei, order =c(1,0,1)) # AIC of -14075arima_nikkei## Series: log_returns_nikkei ## ARIMA(1,0,1) with non-zero mean ## ## Coefficients:## ar1 ma1 mean## 0.665909 -0.707699 0.000182## s.e. 0.189777 0.178365 0.000259## ## sigma^2 = 0.000218814: log likelihood = 7041.36## AIC=-14074.73 AICc=-14074.71 BIC=-14051.4# STEP 2: residuals residuals_sp500 <-residuals(arima_sp500)residuals_cac40 <-residuals(arima_cac40)residuals_nasdaq <-residuals(arima_nasdaq)residuals_nikkei <-residuals(arima_nikkei)############################## AUTO ARIMA ############################## # STEP 1: auto arimaplot_sp500_auto<-auto.arima(log_returns_sp500) # AIC of -22513plot_cac40_auto <-auto.arima(log_returns_cac40) # AIC of -14449 plot_nasdaq_auto <-auto.arima(log_returns_nasdaq)# AIC of -13395 plot_nikkei_auto <-auto.arima(log_returns_nikkei)# AIC of -14080# Create comparison plotpar(mfrow =c(2, 2))plot(residuals_sp500, main ="Residuals - SP500 OWN ARIMA Model")plot(plot_sp500_auto$residuals, main ="Residuals - SP500 Auto ARIMA Model")plot(residuals_cac40, main ="Residuals - CAC40 OWN ARIMA Model")plot(plot_cac40_auto$residuals, main ="Residuals - CAC40 Auto ARIMA Model")
Code
plot(residuals_nasdaq, main ="Residuals - NASDAQ OWN ARIMA Model")plot(plot_nasdaq_auto$residuals, main ="Residuals - NASDAQ Auto ARIMA Model")plot(residuals_nikkei, main ="Residuals - NIKKEI OWN ARIMA Model")plot(plot_nikkei_auto$residuals, main ="Residuals - NIKKEI Auto ARIMA Model")
Code
par(mfrow =c(1, 1))
Using auto.arima yields better results in terms of AIC. By assessing the residuals, we observe that they might be independent, as their mean is close to 0. When it comes to examining the residuals, there doesn’t appear to be a significant visual distinction between the two ARIMA models. To make a thorough assessment, we will analyze their residuals using both the Autocorrelation Function (ACF) and the Ljung-Box test.
1.2.4 (d) Assess the residuals of the resulting models from (c), both their raw values and their absolute values, through visual tools (such as the ACF) and formal tests (e.g. Ljung-Box). What do you conclude about the independence assumption?
Code
# Ljung-Box test for autocorrelation in residuals# Obtain raw residuals of OWN ARIMA raw_residuals_sp500_own<-residuals(arima_sp500)raw_residuals_cac40_own<-residuals(arima_cac40)raw_residuals_nasdaq_own<-residuals(arima_nasdaq)raw_residuals_nikkei_own<-residuals(arima_nikkei)# Obtain raw residuals of AUTO ARIMA raw_residuals_sp500_auto <-residuals(plot_sp500_auto)raw_residuals_cac40_auto <-residuals(plot_cac40_auto)raw_residuals_nasdaq_auto <-residuals(plot_nasdaq_auto)raw_residuals_nikkei_auto <-residuals(plot_nikkei_auto)# Obtain absolute residuals of OWN ARIMA absolute_residuals_sp500_own <-abs(raw_residuals_sp500_own)absolute_residuals_cac40_own <-abs(raw_residuals_cac40_own)absolute_residuals_nasdaq_own <-abs(raw_residuals_nasdaq_own)absolute_residuals_nikkei_own <-abs(raw_residuals_nikkei_own)# Obtain absolute residuals of AUTO ARIMA absolute_residuals_sp500_auto <-abs(raw_residuals_sp500_auto)absolute_residuals_cac40_auto <-abs(raw_residuals_cac40_auto)absolute_residuals_nasdaq_auto <-abs(raw_residuals_nasdaq_auto)absolute_residuals_nikkei_auto <-abs(raw_residuals_nikkei_auto)# Plot ACF of raw residuals of OWN ARIMA par(mfrow =c(2, 2))Acf(raw_residuals_sp500_own, main ="SP 500")Acf(raw_residuals_cac40_own, main ="CAC 40")Acf(raw_residuals_nasdaq_own, main ="NASDAQ")Acf(raw_residuals_nikkei_own, main ="NIKKEI")mtext("ACF of Raw Residuals from OWN ARIMA", outer =TRUE, cex =1, line =-1)
Code
# Plot ACF of raw residuals of AUTO ARIMA par(mfrow =c(2, 2))Acf(raw_residuals_sp500_auto, main ="SP 500")Acf(raw_residuals_cac40_auto, main ="CAC 40")Acf(raw_residuals_nasdaq_auto, main ="NASDAQ")Acf(raw_residuals_nikkei_auto, main ="NIKKEI")mtext("ACF of Raw Residuals from AUTO ARIMA", outer =TRUE, cex =1, line =-1)
Code
# Plot ACF of absolute residuals of OWN ARIMApar(mfrow =c(2, 2))Acf(absolute_residuals_sp500_own, main ="SP 500")Acf(absolute_residuals_cac40_own, main ="CAC40")Acf(absolute_residuals_nasdaq_own, main ="NASDAQ")Acf(absolute_residuals_nikkei_own, main ="NIKKEI")mtext("ACF of Abs Residuals Log Returns from OWN ARIMA", outer =TRUE, cex =1, line =-1)
Code
# Plot ACF of absolute residuals of AUTO ARIMA par(mfrow =c(2, 2))Acf(absolute_residuals_sp500_auto, main ="SP 500")Acf(absolute_residuals_cac40_auto, main ="CAC40")Acf(absolute_residuals_nasdaq_auto, main ="NASDAQ")Acf(absolute_residuals_nikkei_auto, main ="NIKKEI")mtext("ACF of Abs Residuals Log Returns from AUTO ARIMA", outer =TRUE, cex =1, line =-1)
Code
par(mfrow =c(1, 1))# Ljung-Box test on raw residuals of OWN ARIMAp_values_raw_own <-sapply(list(Box.test(raw_residuals_sp500_own, type ="Ljung-Box"),Box.test(raw_residuals_cac40_own, type ="Ljung-Box"),Box.test(raw_residuals_nasdaq_own, type ="Ljung-Box"),Box.test(raw_residuals_nikkei_own, type ="Ljung-Box")), function(x) x$p.value)# Ljung-Box test on raw residuals of AUTO ARIMAp_values_raw_auto <-sapply(list(Box.test(raw_residuals_sp500_auto, type ="Ljung-Box"),Box.test(raw_residuals_cac40_auto, type ="Ljung-Box"),Box.test(raw_residuals_nasdaq_auto, type ="Ljung-Box"),Box.test(raw_residuals_nikkei_auto, type ="Ljung-Box")), function(x) x$p.value)# Ljung-Box test on absolute residuals of OWN ARIMAp_values_abs_own <-sapply(list(Box.test(absolute_residuals_sp500_own, type ="Ljung-Box"),Box.test(absolute_residuals_cac40_own, type ="Ljung-Box"),Box.test(absolute_residuals_nasdaq_own, type ="Ljung-Box"),Box.test(absolute_residuals_nikkei_own, type ="Ljung-Box")), function(x) x$p.value)# Ljung-Box test on absolute residuals of AUTO ARIMAp_values_abs_auto <-sapply(list(Box.test(absolute_residuals_sp500_auto, type ="Ljung-Box"),Box.test(absolute_residuals_cac40_auto, type ="Ljung-Box"),Box.test(absolute_residuals_nasdaq_auto, type ="Ljung-Box"),Box.test(absolute_residuals_nikkei_auto, type ="Ljung-Box")), function(x) x$p.value)# Set scipen option to display p-values in full numeric formatoptions(scipen =999)# Create a data frame with p-values and index namesp_values_data <-data.frame(Index =rep(c("SP500", "CAC40", "NASDAQ", "NIKKEI"), each =4),Model =rep(c("Raw (OWN ARIMA)", "Raw (AUTO ARIMA)", "Absolute (OWN ARIMA)", "Absolute (AUTO ARIMA)"), times =4),P_Value =c(p_values_raw_own, p_values_raw_auto, p_values_abs_own, p_values_abs_auto))# Create a kable with the desired stylingkable_output <-kable(p_values_data, format ="html", escape =FALSE, caption ="P-Values from Ljung-Box Test") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =FALSE)# Print the kable to the R console (note: this will not render as HTML in the console)kable_output
P-Values from Ljung-Box Test
Index
Model
P_Value
SP500
Raw (OWN ARIMA)
0.981090694
SP500
Raw (AUTO ARIMA)
0.929812388
SP500
Absolute (OWN ARIMA)
0.661245435
SP500
Absolute (AUTO ARIMA)
0.975504647
CAC40
Raw (OWN ARIMA)
0.983060220
CAC40
Raw (AUTO ARIMA)
0.495130124
CAC40
Absolute (OWN ARIMA)
0.493721726
CAC40
Absolute (AUTO ARIMA)
0.992305513
NASDAQ
Raw (OWN ARIMA)
0.000000000
NASDAQ
Raw (AUTO ARIMA)
0.000000000
NASDAQ
Absolute (OWN ARIMA)
0.000000000
NASDAQ
Absolute (AUTO ARIMA)
0.000373729
NIKKEI
Raw (OWN ARIMA)
0.000000000
NIKKEI
Raw (AUTO ARIMA)
0.000000000
NIKKEI
Absolute (OWN ARIMA)
0.000000000
NIKKEI
Absolute (AUTO ARIMA)
0.000760171
The ACF plots and Ljung-Box test results for the ARIMA model residuals—both raw and absolute—highlight concerns regarding the independence of these residuals across the S&P 500, CAC 40, NASDAQ, and Nikkei indices. Notably, the ACF plots for raw residuals indicate significant autocorrelation at various lags for certain indices, suggesting potential residual dependencies. The plots for absolute residuals further underscore this, showing pronounced autocorrelation, especially for the NASDAQ index, indicating the presence of volatility clustering.
Comparative analysis between the AUTO ARIMA and OWN ARIMA models reveals more pronounced autocorrelation in the AUTO ARIMA residuals, suggesting that the OWN ARIMA may better capture the series dynamics to some extent. However, near-zero p-values from the Ljung-Box tests for absolute residuals across all indices decisively reject the null hypothesis of independence, confirming that significant autocorrelation persists.
These findings suggest that while the raw residuals might not always display autocorrelation, their absolute values do, indicating non-constant variance and casting doubt on the adequacy of the models. The implication is clear: a model that can handle conditional variance, such as GARCH, is warranted for a more accurate representation of the financial time series, given the heteroscedastic nature revealed by these tests.
1.2.5 (e) Plot the volatility of the raw series of indices. What is your conclusion on the homoscedasticity assumption?
Code
# SP500volatility_sp500 <-volatility(as.vector(sp500.df$SP500))volatility_cac40 <-volatility(as.vector(cac40.df$CAC40))volatility_nasdaq <-volatility(as.vector(nasdaq.df$NASDAQ))volatility_nikkei <-volatility(as.vector(nikkei.df$NIKKEI))# Create new data frames for each index that combine dates and volatilitysp500_data <-data.frame(Date = sp500.df$DATE, Volatility = volatility_sp500)cac40_data <-data.frame(Date = cac40.df$DATE, Volatility = volatility_cac40)nasdaq_data <-data.frame(Date = nasdaq.df$DATE, Volatility = volatility_nasdaq)nikkei_data <-data.frame(Date = nikkei.df$DATE, Volatility = volatility_nikkei)# Now plot using the new data frames for the x-axis and y-axisp <-plot_ly() %>%add_trace(data = sp500_data, x =~Date, y =~Volatility, name ='S&P 500', type ='scatter', mode ='lines', line =list(color ='#1f77b4')) %>%add_trace(data = cac40_data, x =~Date, y =~Volatility, name ='CAC 40', type ='scatter', mode ='lines', line =list(color ='#ff7f0e'), visible = F) %>%add_trace(data = nasdaq_data, x =~Date, y =~Volatility, name ='NASDAQ', type ='scatter', mode ='lines', line =list(color ='#2ca02c'), visible = F) %>%add_trace(data = nikkei_data, x =~Date, y =~Volatility, name ='Nikkei', type ='scatter', mode ='lines', line =list(color ='#d62728'), visible = F) %>%layout(title ="Volatility of Major Stock Indices",xaxis =list(title ="Date", type ='date'), # Ensure that the x-axis is treated as a dateyaxis =list(title ="Volatility"),updatemenus =list(list(type ="buttons",active =-1,buttons =list(list(label ="S&P 500",method ="update",args =list(list(visible =c(TRUE, FALSE, FALSE, FALSE)),list(title ="Volatility of the Raw S&P 500"))),list(label ="CAC 40",method ="update",args =list(list(visible =c(FALSE, TRUE, FALSE, FALSE)),list(title ="Volatility of the Raw CAC 40"))),list(label ="NASDAQ",method ="update",args =list(list(visible =c(FALSE, FALSE, TRUE, FALSE)),list(title ="Volatility of the Raw NASDAQ"))),list(label ="Nikkei",method ="update",args =list(list(visible =c(FALSE, FALSE, FALSE, TRUE)),list(title ="Volatility of the Raw NIKKEI"))) ) ) ) )# Display the plotp
Volatility graphs indicate periods with significant spikes in volatility (ex: Nasdaq between 1999-2000), which challenges the assumption of homoscedasticity. This assumption—that the variance of the error term remains constant throughout the time series—is a foundational element of the Black-Scholes model. The presence of heteroscedasticity, where variance fluctuates over time, undermines the model’s premises, since Black-Scholes requires stable volatility to accurately price options
Therefore, we will fit a GARCH model that addresses the issue of heteroscedasticity of the residuals.
1.2.6 (f) Fit GARCH models to the negative log returns of each series with both standardised and skewed t-distributions, with order (1; 1), using the garchFit() function from the fGarch library. Assess the quality of the fit by evaluating the residuals.
par(mfrow =c(2, 2))acf(residuals_sp500_std^2, main="SP500")acf(residuals_cac40_std^2, main="CAC40")acf(residuals_NASDAQ_std^2, main="NASDAQ")acf(residuals_nikkei_std^2, main="NIKKEI")mtext("ACF of Squared Residuals with standardized t-distribution", outer =TRUE, cex =1, line =-1)
Code
# For skewed distributionpar(mfrow =c(2, 2))plot(residuals_sp500_sstd, main="Residuals of SP500 GARCH Model with sstd t-dist")plot(residuals_cac40_sstd, main="Residuals of CAC40 GARCH Model with sstd t-dist")plot(residuals_NASDAQ_sstd, main="Residuals of NASDAQ GARCH Model with sstd t-dist")plot(residuals_nikkei_sstd, main="Residuals of NIKKEI GARCH Model with sstd t-dist")mtext("Residuals of Garch Model with skewed t-distribution", outer =TRUE, cex =1, line =-1)
Code
par(mfrow =c(2, 2))acf(residuals_sp500_sstd^2, main="SP500")acf(residuals_cac40_sstd^2, main="CAC40")acf(residuals_NASDAQ_sstd^2, main="NASDAQ")acf(residuals_nikkei_sstd^2, main="NIKKEI")mtext("ACF of Squared Residuals with skewed t-distribution", outer =TRUE, cex =1, line =-1)
Code
# Perform Ljung-Box tests and collect the p-valuesp_values_squared_residuals <-c(Box.test(residuals_sp500_std^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_sp500_sstd^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_cac40_std^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_cac40_sstd^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_NASDAQ_std^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_NASDAQ_sstd^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_nikkei_std^2, lag =20, type ="Ljung-Box")$p.value,Box.test(residuals_nikkei_sstd^2, lag =20, type ="Ljung-Box")$p.value)# Create a data frame with the resultsljung_box_results <-data.frame(Index =rep(c("SP500", "CAC40", "NASDAQ", "NIKKEI"), each =2),Distribution =rep(c("STD", "SSTD"), times =4),P_Value = p_values_squared_residuals)# Create a kablekable_output <-kable(ljung_box_results, format ="html", caption ="Ljung-Box Test Results on Squared Residuals") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width = F)# Print the kable to the R console (note: this will not render as HTML in the console)kable_output
Ljung-Box Test Results on Squared Residuals
Index
Distribution
P_Value
SP500
STD
0.751176895
SP500
SSTD
0.767092017
CAC40
STD
0.784395624
CAC40
SSTD
0.800149964
NASDAQ
STD
0.277930731
NASDAQ
SSTD
0.313847177
NIKKEI
STD
0.996172310
NIKKEI
SSTD
0.996168168
Ho: data are independently distributed, implying that there is no autocorrelation among the values. H1: data exhibit some level of autocorrelation, meaning that past values in the series have a statistically significant influence on future values.
The quality of the fit is very poor as the null hypothesis is rejected, implying that the negative log return exhibit some level of autocorrelation.
Code
# Determining if the skewed t distribution is better than the standard t distributionboxplot(residuals_sp500_std - residuals_sp500_sstd)
As we can see from these boxplots, the standard t distribution and the skewed t distribution are slightly different, but the difference is so minimal that even if the skewness is a significant parameter, it does not change the results in a significant manner. Therefore, we will proceed with the simpler model, which is the standard t distribution.
1.2.7 (g) Residual serial correlation can be present when fitting a GARCH directly on the negative log returns.
# Step 3: Assess the quality of the fitresiduals_sp500_garch <-residuals(garch_sp500_std, standardize =TRUE)residuals_cac40_garch <-residuals(garch_cac40_std, standardize =TRUE)residuals_NASDAQ_garch <-residuals(garch_NASDAQ_std, standardize =TRUE)residuals_nikkei_garch <-residuals(garch_nikkei_std, standardize =TRUE)par(mfrow =c(2, 2))plot(residuals_sp500_garch, main="SP500 residuals with std t-dist")plot(residuals_cac40_garch, main="CAC40 residuals with std t-dist")plot(residuals_NASDAQ_garch, main="NASDAQ residuals with std t-dist")plot(residuals_nikkei_garch, main="NIKKEI residuals with std t-dist")
Code
Box.test(residuals_sp500_garch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_sp500_garch## X-squared = 29.20505, df = 20, p-value = 0.0837997Box.test(residuals_cac40_garch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_cac40_garch## X-squared = 22.33086, df = 20, p-value = 0.32283Box.test(residuals_NASDAQ_garch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_NASDAQ_garch## X-squared = 27.29709, df = 20, p-value = 0.127111Box.test(residuals_nikkei_garch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_nikkei_garch## X-squared = 23.63077, df = 20, p-value = 0.258893Box.test(residuals_sp500_garch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_sp500_garch^2## X-squared = 16.12304, df = 20, p-value = 0.708962Box.test(residuals_cac40_garch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_cac40_garch^2## X-squared = 14.8594, df = 20, p-value = 0.784396Box.test(residuals_NASDAQ_garch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_NASDAQ_garch^2## X-squared = 23.44418, df = 20, p-value = 0.267514Box.test(residuals_nikkei_garch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_nikkei_garch^2## X-squared = 7.058748, df = 20, p-value = 0.996488
As we can see from the results of the Ljung-Box test on both the residuals and the squared residuals, we cannot reject the null hypothesis of autocorrelation being present in the residuals of the GARCH(1,1) model. As a result, we have a good quality fit wit a GARCH(1,1), though a bit weaker in the case of SP500.
1.2.8 (h) Use the garchAuto.R script in order to fit a GARCH on the residuals of the ARMA(p,q) from (g). Assess the quality of the t.
Code
# Auto GARCH formulagarchAutoTryFit =function( ll, data,trace=FALSE,forecast.length=1,with.forecast=TRUE,ic="AIC",garch.model="garch" ){ formula =as.formula( paste( sep="","~ arma(", ll$order[1], ",", ll$order[2], ")+", garch.model,"(", ll$order[3], ",", ll$order[4], ")" ) ) fit =tryCatch( garchFit( formula=formula,data=data,trace=FALSE,cond.dist=ll$dist ),error=function( err ) TRUE,warning=function( warn ) FALSE ) pp =NULLif( !is.logical( fit ) ) {if( with.forecast ) { pp =tryCatch( predict( fit,n.ahead=forecast.length,doplot=FALSE ),error=function( err ) FALSE,warning=function( warn ) FALSE )if( is.logical( pp ) ) { fit =NULL } } } else { fit =NULL }if( trace ) {if( is.null( fit ) ) {cat( paste( sep=""," Analyzing (", ll$order[1], ",", ll$order[2],",", ll$order[3], ",", ll$order[4], ") with ", ll$dist, " distribution done.","Bad model.\n" ) ) } else {if( with.forecast ) {cat( paste( sep=""," Analyzing (", ll$order[1], ",", ll$order[2], ",", ll$order[3], ",", ll$order[4], ") with ", ll$dist, " distribution done.","Good model. ", ic, " = ", round(fit@fit$ics[[ic]],6),", forecast: ",paste( collapse=",", round(pp[,1],4) ), "\n" ) ) } else {cat( paste( sep=""," Analyzing (", ll[1], ",", ll[2], ",", ll[3], ",", ll[4], ") with ", ll$dist, " distribution done.","Good model. ", ic, " = ", round(fit@fit$ics[[ic]],6), "\n" ) ) } } }return( fit )}garchAuto =function( xx,min.order=c(0,0,1,1),max.order=c(5,5,1,1),trace=FALSE,cond.dists="sged",with.forecast=TRUE,forecast.length=1,arma.sum=c(0,1e9),cores=1,ic="AIC",garch.model="garch" ){require( fGarch )require( parallel ) len =NROW( xx ) models =list( )for( dist in cond.dists )for( p in min.order[1]:max.order[1] )for( q in min.order[2]:max.order[2] )for( r in min.order[3]:max.order[3] )for( s in min.order[4]:max.order[4] ) { pq.sum = p + qif( pq.sum <= arma.sum[2] && pq.sum >= arma.sum[1] ) { models[[length( models ) +1]] =list( order=c( p, q, r, s ), dist=dist ) } } res =mclapply( models, garchAutoTryFit,data=xx,trace=trace,ic=ic,garch.model=garch.model,forecast.length=forecast.length,with.forecast=TRUE,mc.cores=cores ) best.fit =NULL best.ic =1e9for( rr in res ) {if( !is.null( rr ) ) { current.ic = rr@fit$ics[[ic]]if( current.ic < best.ic ) { best.ic = current.ic best.fit = rr } } }if( best.ic <1e9 ) {return( best.fit ) }return( NULL )}
Code
#Fit the AutoGarch on the ARMA residualsautogarch_sp500_std <-garchAuto(arma_sp500_residuals)## Loading required package: parallelautogarch_cac40_std <-garchAuto(arma_cac40_residuals)autogarch_NASDAQ_std <-garchAuto(arma_NASDAQ_residuals)autogarch_nikkei_std <-garchAuto(arma_nikkei_residuals)#Get the standardized residuals of the AutoGarchresiduals_sp500_autogarch <-residuals(autogarch_sp500_std, standardize =TRUE)residuals_cac40_autogarch <-residuals(autogarch_cac40_std, standardize =TRUE)residuals_NASDAQ_autogarch <-residuals(autogarch_NASDAQ_std, standardize =TRUE)residuals_nikkei_autogarch <-residuals(autogarch_nikkei_std, standardize =TRUE)par(mfrow =c(2, 2))plot(residuals_sp500_autogarch, main="SP500 residuals with std t-dist")plot(residuals_cac40_autogarch, main="CAC40 residuals with std t-dist")plot(residuals_NASDAQ_autogarch, main="NASDAQ residuals with std t-dist")plot(residuals_nikkei_autogarch, main="NIKKEI residuals with std t-dist")
Code
#Assess the quality of the fitBox.test(residuals_sp500_autogarch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_sp500_autogarch## X-squared = 32.82426, df = 20, p-value = 0.0352692Box.test(residuals_cac40_autogarch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_cac40_autogarch## X-squared = 19.67275, df = 20, p-value = 0.478561Box.test(residuals_NASDAQ_autogarch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_NASDAQ_autogarch## X-squared = 27.86324, df = 20, p-value = 0.112676Box.test(residuals_nikkei_autogarch, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_nikkei_autogarch## X-squared = 17.4285, df = 20, p-value = 0.624999Box.test(residuals_sp500_autogarch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_sp500_autogarch^2## X-squared = 15.63696, df = 20, p-value = 0.738876Box.test(residuals_cac40_autogarch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_cac40_autogarch^2## X-squared = 16.42077, df = 20, p-value = 0.690201Box.test(residuals_NASDAQ_autogarch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_NASDAQ_autogarch^2## X-squared = 22.03125, df = 20, p-value = 0.338818Box.test(residuals_nikkei_autogarch^2, lag =20, type ="Ljung-Box")## ## Box-Ljung test## ## data: residuals_nikkei_autogarch^2## X-squared = 7.595681, df = 20, p-value = 0.994223
The results of the Ljung Box tests on the autoGarch function residuals give us non-significant p-values as well, except for the SP500. This means that we cannot reject the null hypothesis of autocorrelation between the residuals for CAC40, NASDAQ and Nikkei, but we do reject it at a 95% confidence level for SP500. However, the squared residuals for SP500 show a high enough p-value for it to not be rejected.
We can conclude that SP500 could still have autocorrelation in its residuals, but it depends on which Ljung-Box test we decide to trust more (normal residuals or squared residuals).
2 Practical 2
2.1 Part 1: Venice
2.1.1 (a) Read in the data. Extract and represent the yearly max values from 1940 to 2009. What do you observe ?
Code
# Read the Datavenice <- venice90# Using block maxima approach to have the maximum sea level per year venice_max <- venice %>%group_by(year) %>%summarise(max_sea_level =max(sealevel))head(venice_max)## # A tibble: 6 x 2## year max_sea_level## <int> <dbl>## 1 1940 101## 2 1941 99## 3 1942 93## 4 1943 96## 5 1944 106## 6 1945 104plot_ly(venice_max, x =~year, y =~max_sea_level, type ='scatter', mode ='markers', name ='Max Value') %>%layout(title ="Maximum Value per year", xaxis =list(title="Year"), yaxis =list(title="Maximum Value")) %>%add_segments(x =1940, xend =2009, y =140, yend =140, line =list(color ='red', width =2))
From the plot, we can discern that there are a total of 11 data points that surpass the 140 cm threshold, marked by the red line. The highest recorded sea level, occurring in 1966, is notably distinct as the peak value in the dataset, reaching 192 cm. When considering the distribution of the maximum values each year, there’s a hint of a potential trend in the data. Whether extreme values are stationary or not will be investigated in (D) using likelihood ratio test.
2.1.2 (b) We are end of 2009 and would like to predict the yearly maximum values over the next 13 years (from 2010 to 2022). A naive approach consists of fitting a linear model on the observed yearly maxima and predict their values for 2010-2022. Proceed to this prediction and provide confidence intervals.
Code
# Create linear model mod1 <-lm(max_sea_level ~ year, data = venice_max)summary(mod1)## ## Call:## lm(formula = max_sea_level ~ year, data = venice_max)## ## Residuals:## Min 1Q Median 3Q Max ## -26.49259 -11.48509 -2.37432 9.97635 71.86521 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -430.228256 200.361708 -2.14726 0.035341 * ## year 0.279941 0.101469 2.75887 0.007445 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 17.1533 on 68 degrees of freedom## Multiple R-squared: 0.100664, Adjusted R-squared: 0.0874386 ## F-statistic: 7.61135 on 1 and 68 DF, p-value: 0.00744505# Predictions of 13 next years using the linear model mod1_predict <-predict.lm(mod1,newdata=data.frame("year"=c(2010:2022)),se=T, interval ="confidence", level =0.95)mod1_predict## $fit## fit lwr upr## 1 132.452174 124.181461 140.722886## 2 132.732114 124.284836 141.179393## 3 133.012055 124.387071 141.637039## 4 133.291995 124.488236 142.095755## 5 133.571936 124.588395 142.555476## 6 133.851876 124.687608 143.016145## 7 134.131817 124.785929 143.477705## 8 134.411758 124.883408 143.940107## 9 134.691698 124.980095 144.403301## 10 134.971639 125.076031 144.867246## 11 135.251579 125.171260 145.331898## 12 135.531520 125.265818 145.797221## 13 135.811460 125.359742 146.263178## ## $se.fit## 1 2 3 4 5 6 ## 4.14474633 4.23322998 4.32228437 4.41187491 4.50196961 4.59253880 ## 7 8 9 10 11 12 ## 4.68355494 4.77499248 4.86682767 4.95903841 5.05160415 5.14450571 ## 13 ## 5.23772523 ## ## $df## [1] 68## ## $residual.scale## [1] 17.1532717
As anticipated, the simple approach leads to poor predictions, with confidence intervals spanning from 124 to 146. The main reason for this lackluster performance is the violation of key assumptions in linear modeling when dealing with extreme values, such as: linearity, homoescedasticity and increased sensitivity to outliers. Consequently, we will turn to specialized models rooted in the realm of extreme value theory to address these challenges more effectively.
2.1.3 (c) Represent in the same graph the predicted yearly max values for the period 2010-2022, their pointwise confidence bounds and the observed values greater than 140 cm from the table below.
#Create a new dataframe for the extreme values of 2010 - 2022 (table from Wikipedia)max_real <-data.frame(year =c(2012, 2012, 2013, 2018, 2019, 2019, 2019, 2022), max_sea_level =c(143, 149, 143, 156, 187, 144, 154, 204))# Create the ggplot objectvenice_plot <-ggplot() +geom_point(data = max_real, aes(x = year, y = max_sea_level, color ="Observed"), alpha =0.5, show.legend =TRUE, name ="Observed") +geom_point(data = venice_max_predict, aes(x = year, y = PredictedValues.fit.fit, color ="Predicted"), shape =1, show.legend =TRUE, name ="Predicted") +labs(title ="Predicted Yearly Max Values vs Observed Values (>140cm)", x ="Year", y ="Sea Level") +scale_x_continuous(breaks =unique(c(venice_max_predict$year, max_real$year))) +scale_color_manual(name ="Data Type", values =c("Observed"="red", "Predicted"="black")) +theme(legend.title =element_blank())# Convert the ggplot object to a Plotly interactive plotinteractive_venice_plot <-ggplotly(venice_plot)# Add the confidence interval as a separate traceinteractive_venice_plot <- interactive_venice_plot %>%add_ribbons(data = venice_max_predict, x =~year, ymin =~PredictedValues.fit.lwr, ymax =~PredictedValues.fit.upr, color =I("blue"), showlegend =TRUE, name ="Confidence Interval")# Display the interactive plotinteractive_venice_plot
As previously noted and showcased in the plot, the predictions of extreme values using a linear model exhibit significant shortcomings. The predicted values follow a linear pattern and consistently fall short of accurately capturing extreme events. There is one exception, the observation on the 13th of November 2019, for which the Confidence Interval of the predicted value aligns well with the actual value. Nonetheless, even in this case, the model underestimates the observed value by approximately 9 cm, a substantial deviation.
To address these limitations, we are turning to the Generalized Extreme Value (GEV) distribution, a more suitable approach for modeling extreme events.
2.1.4 (d) Fit a GEV a with constant parameters to the historical yearly max values. Fit a GEV with time varying location parameter. Compare the two embedded models using likelihood ratio test (LRT). Show diagnostic plots.
Code
########################################## extRemes ########################################### First we unlist our data framelist_max_sea_levels <-unlist(venice_max$max_sea_level)# GEV model with fixed location gev_fix <-fevd(list_max_sea_levels, type ="GEV", time.units="year")plot(gev_fix) ; gev_fix$results$par ; return.level(gev_fix) # shape is almost equal to 0.
## location scale shape
## 114.5629627966 14.5669041620 -0.0328669377
## fevd(x = list_max_sea_levels, type = "GEV", time.units = "year")
## get(paste("return.level.fevd.", newcl, sep = ""))(x = x, return.period = return.period)
##
## GEV model fitted to list_max_sea_levels
## Data are assumed to be stationary
## [1] "Return Levels for period units in years"
## 2-year level 20-year level 100-year level
## 119.869893 155.784722 176.753120
# Compute confidence interval
ci(gev_fix, type= "parameter") # CI includes 0
## fevd(x = list_max_sea_levels, type = "GEV", time.units = "year")
##
## [1] "Normal Approx."
##
## 95% lower CI Estimate 95% upper CI
## location 110.724115809 114.5629627966 118.401809784
## scale 11.785048780 14.5669041620 17.348759544
## shape -0.200251569 -0.0328669377 0.134517694
# GEV model with varying time-location using both linear and harmonic function
yy <- 1:length(venice_max$year)
gev_time_varying_linear <- fevd(venice_max$max_sea_level, venice_max, type = "GEV", time.units= "year", location.fun = ~ yy)
gev_time_varying_harmonic <- fevd(venice_max$max_sea_level, venice_max, type = "GEV", time.units= "year", location.fun = ~sin(2 * pi * (year - 1940)/70) + cos(2 * pi * (year - 1940)/70))
plot(gev_time_varying_linear) ; plot(gev_time_varying_harmonic) ;#gev_time_varying_linear$results$par ; gev_time_varying_harmonic$results$par; return.level(gev_time_varying_linear) ; return.level(gev_time_varying_harmonic)
Code
# Compare the two models using likelihood ratio test: Ho: no significant difference in model fit H1: there is a significant difference in model fit between the two models.lrt_result_linear <-lr.test(gev_fix, gev_time_varying_linear) # we can almost reject the null hypothesis at 95% significant level, which make sense as we have indications that the distribution is non-stationarylrt_result_harmonic <-lr.test(gev_fix, gev_time_varying_harmonic) attributes(gev_time_varying_linear)## $names## [1] "call" "data.name" "weights" ## [4] "in.data" "missing.values" "x" ## [7] "cov.data" "method" "type" ## [10] "period.basis" "par.models" "const.loc" ## [13] "const.scale" "const.shape" "n" ## [16] "na.action" "parnames" "results" ## [19] "initial.results"## ## $class## [1] "fevd"# 589.5372 linear --> this is sufficient as lower AIC # 594.9147 harmonic --> too complex, we stick with the linear (time varying location)
We initially observed that the shape parameter of the fixed Generalized Extreme Value (GEV) model was close to 0. This suggested that the extreme values closely follow a Gumbel distribution, which is characterized by an exponential tail. To confirm this hypothesis, we calculated a confidence interval using the ci function in the extRemes package. The resulting confidence interval included 0, indicating that the shape parameter was not significantly different from 0.
Next, we aimed to fit a GEV model with time-varying location parameters. We explored two different functions: linear and harmonic. The linear function assumed that “Year” is treated as a linear predictor, and the location parameter is assumed to change linearly over time. However, the likelihood ratio test indicated that the fixed GEV model (stationary) was a better fit. This result made sense because while there were some patterns in the maximum values, the linear function did not capture the entire variability of the data.
On the other hand, the harmonic function, which is often used to model periodic phenomena, yielded a different outcome. It allowed us to reject the null hypothesis of the likelihood ratio test, indicating that the GEV model with time-varying location parameters was a better fit. This result suggested that our maximum values displayed non-stationary behavior.
2.1.5 (e) Add if necessary a time varying scale and or shape GEV parameter. Select the best model according to LRT.
Code
# Fit a GEV model with time-varying scale gev_time_varying_scale <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", scale.fun =~ yy) # AIC: 600gev_time_varying_shape <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", shape.fun =~ yy) # AIC: 600.7072 gev_time_varying_scale_shape <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", shape.fun =~ yy, scale.fun =~ yy ) # AIC: 601.9gev_time_varying_location_scale <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", scale.fun =~ yy, location.fun =~ yy) # AIC: 590.6gev_time_varying_location_shape <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", shape.fun =~ yy, location.fun =~ yy) # AIC: 586gev_time_varying_scale_shape_location <-fevd(venice_max$max_sea_level, venice_max, type ="GEV", time.units ="year", shape.fun =~ yy, location.fun =~ yy, scale.fun =~ yy ) # AIC: 586 lrt_result <-lr.test(gev_time_varying_location_shape, gev_time_varying_scale_shape_location) # can't reject the null hypothesis so we decide to choose the gev_time_varying_location_shape
After comparing the AIC of the models above, it aprears that the best models are the “gev_time_varying_location_shape” and “gev_time_varying_scale_shape_location”, for which we did the LRT test. it shows us that the best model is the less complex one: the “gev_time_varying_location_shape”.
2.1.6 (f) Predict the 13-years return level, each year from 2010 to 2022.
The obtained return levels are quite stable, slowly increasing over time from 148 to 149.
2.1.7 (g) Calculate confidence bands for these predictions
Here, we can observe a trend slowly increasing from the predicted values. However, there is a noticeable problem with the method we used. Indeed, the confidence intervals are way too close to the return level, giving us bounds which are too narrow to observe anything significant. A more complete analysis would require a de-trending of the bootstrapped values, but we were limited by the time allowed to complete this work.
2.1.8 (h) Represent in the same graph your predictions of the 13-years return levels, their pointwise confidence intervals, the predicted yearly max values from the linear model and the observed values greater than 140 cm from the table below.
Code
rlci$Year =c(2010:2022)return_levels_df$Year =c(2010:2022)interactive_venice_plot %>%add_lines(data = rlci, x =~Year, y =~return_level, color =I("red"), name ="Predicted values (model)") %>%add_ribbons(data = rlci, x =~Year, ymin =~Lower_CI, ymax =~Upper_CI, color =I("red"), name ="Confidence Interval (model)")
2.1.9 (i) Broadly speaking, each year, there is a chance of 1/13 that the observed value is above the 13- years return level. Comment the results for both the linear model prediction and GEV approach. Note that 12 of the 20 events occurred in the 21st century.
The main issue we see is that the observed values from 2010 to 2022 are much higher than the expected return level. This indicates that the models we are using are crucially underestimating how fast the sea level is currently rising, most notably the extreme values.
The linear model predictions did not give us something reliable enough, as it did not take into account its parameters varying. As a result, only one of the observed values is inside of the confidence intervals.
the return levels from the GEV model are giving us better results overall, but still far from the reality. If we imagine similarly wide confidence intervals for the GEV (ignoring the limitations we had) compared to the linear model, we could potentially have up to 6/8 values inside its bounds, which would be somewhat good, except for the two highest values (12/11/2019 and 22/11/2022), which are on a scope that wasn’t even imagined by the model.
In conclusion, the models we used until now were all incomplete, and could not predict the massive change the sea levels underwent, most notably in the latest 3 years where the values attained heights never seen before. This event can be explained through the effects of global warming, accelerating the pace at which the sea level rises worldwide.
2.2 Part 2: Nuclear reactors
2.2.1 (a) Read in the data. Display a time series plot of the water level across the data range and try to identify times of highest levels.
Code
# Load the dataload(here::here("data/practical_2/niveau.Rdata"))#niveau$Wert <- format(niveau$Wert, nsmall = 2)# Convert to Date formatniveau$Zeitstempel <-as.Date(niveau$Zeitstempel)niveau$Zeitpunkt_des_Auftretens <-as.Date(niveau$Zeitpunkt_des_Auftretens)# Identifying points above the thresholdquantile(niveau$Wert, 0.95)## 95% ## 326.9117threshold <-326.9117above_threshold <- niveau[niveau$Wert > threshold, ]# Create a base plotp <-plot_ly(data = niveau, x =~Zeitstempel, y =~Wert, type ='scatter', mode ='lines',name ='Water Level', hoverinfo ='x+y')# Add points above the thresholdp <-add_trace(p, data = above_threshold, x =~Zeitstempel, y =~Wert, mode ='markers',marker =list(color ='red', size =10), name ='Above Threshold')# Customize layoutp <-layout(p, title ='Time Series Plot of Water Levels',xaxis =list(title ='Date'),yaxis =list(title ='Water Level (m)'))# Show the plotp
To pinpoint the periods with the highest water levels, we’ve established a threshold of 326.91 meters (95% quantile). Upon analysis, the peak water level was observed in August 2007, surpassing this predefined threshold, with a value of 329.32 meters. Furthermore, in July 2021, there were many instances where the water levels exceeded the established threshold of 326.91 meters.
2.2.2 (b) Now display a histogram of the water levels. What do you observe about the distribution?
Code
# Plot histogram with custom axishist(niveau$Wert, breaks =30, col ="skyblue", xlab ="Water Level (m)", main ="Histogram of Water Levels", xaxt ='n') # Turn off default x-axis# Adding a line for meanmean_value <-mean(niveau$Wert)abline(v = mean_value, col ="red", lwd =2)# Adding custom x-axis with decimalsaxis(1, at =seq(floor(min(niveau$Wert)), ceiling(max(niveau$Wert)), by =0.5), las =2)# Adding a legendlegend("topright", legend =paste("Mean =", round(mean_value, 2)), col ="red", lwd =2)
The histogram reveals a right-skewed distribution, indicating a higher frequency of lower water level readings and a gradual decrease in frequency as the levels rise. Most of the observed water levels concentrate below 327 meters. Notably, the distribution’s tail extends to the right, suggesting infrequent occurrences of substantially elevated water levels. These exceptional values could represent sporadic episodes of extreme water conditions.
2.2.3 (c) Explain how you would model the high water levels using a peaks-over-threshold approach.
The peaks-over-threshold (POT) approach is a statistical method used to model extreme values in a dataset. It focuses on the tail of the distribution, where the extreme events reside. To model high water levels using the POT approach, I would follow these steps:
Select a threshold: choose an appropriate threshold above which water level readings are considered extreme and above which the plot is roughly linear. This threshold should be high enough to exclude mundane variations but not so high that you have too few data points to model accurately.
Extracting exceedances: once the threshold is set, identify all the data points that exceed it. These are the “peaks” you will analyze. Use the mean excess plot.
Fitting a Distribution: fit a Generalized Pareto Distribution (GPD) to the exceedances. The fitting can be done using maximum likelihood estimation or Bayesian methods.
Parameter Estimation: which are the shape parameter (kappa), the scale parameter (sigma), and the location parameter (theta). These parameters define the behavior of the distribution’s tail.
Model Diagnostics: after fitting the model, perform diagnostic checks to ensure that the GPD provides a good fit to the data: analyzing residual plots, conducting goodness-of-fit tests, and using plots comparing the empirical and theoretical exceedance probabilities.
Clustering: apply the Ferro-Segers estimate, using the extremalindex function in R, counts exceedances of a high threshold within time windows. A high extremal index (\>1) suggests clustering of exceedances, indicating a need for declustering.
Estimating Return Levels: estimate high quantiles (return levels) and their corresponding return periods. The return level is the value that is expected to be exceeded once every specific period (e.g., a 100-year flood level).
Assessing Uncertainty: confidence intervals for the return levels can be constructed, use confint.
2.2.4 (d) Comment on the aspect of clustering of extremes. How do you propose to measure and deal with clustering of the daily water levels?
Clustering of extremes refers to the tendency of extreme events to occur in sequences or bursts rather than being uniformly distributed over time. This is particularly relevant in environmental data, such as water levels, where extreme events can be dependent on preceding conditions (ex: Chaux-de-fonds heavy rainfall during 3 days).
To measure and deal with clustering in daily water levels, we propose the following steps:
Measuring clustering:
Run Tests for Serial Dependence: use statistical tests like the autocorrelation function (ACF), or the Ljung-Box test to detect serial dependence in the data above the threshold.
Identify Clusters: use the extremalindex function to determine whether. The “declustering” period—how much time must pass without an exceedance before a new cluster is defined—needs to be determined.
Declustering Methods: apply declustering algorithms to separate the data into independent clusters of exceedances and intervening non-exceedances. A straightforward declustering method, is to use a for loopto identify and group extreme precipitation events by their occurrence in distinct years,and then use the decluster function from extRemes package.
Dealing with clustering:
Declustering the Data: before fitting a model to the exceedances, decluster the data to remove the influence of serial dependence. Each cluster may be represented by its maximum value.
Modeling with Clustering
Intensity Measures: such as the mean number of exceedances per cluster or the average duration of clusters into the risk assessment.
Time-Varying Thresholds: consider dynamic thresholds that adjust for seasonality or other known factors that influence clustering, rather than a fixed threshold.
2.2.5 (e) Perform the analysis you suggest in c) and d) and compute the 50- and 100-year return levels. Explain your choice of threshold and provide an estimate of uncertainty for the return levels. Note: take care to compute the return level in yearly terms.
Code
# 1. Select Threshold: using mean residual life plot, quantile and threshrange.plotplot(niveau$Wert)
Code
meplot(niveau$Wert, at =pretty(niveau$Wert, n =10), labels =format(pretty(niveau$Wert, n =10), nsmall =1))
Code
mrlplot(niveau$Wert,main="Mean Residual Life Plot")
Code
quantile(niveau$Wert, 0.95) # 326.9117 ## 95% ## 326.9117threshrange.plot(niveau$Wert, r =c(325, 329), nint =16) # Choose 327## Error in get(paste("ci.", newcl, sep = ""))(x = x, alpha = alpha, type = type, : ## ci: Sorry, unable to calculate the parameter covariance matrix. Maybe try a different method.## Error in if (ipars1["scale"] <= 0) ipars1["scale"] <- 0.00000001 : ## missing value where TRUE/FALSE needed
Code
threshold_clustering <-326.9117# select 327 as the threshold# 2. Extract Exceedancesexceedances <- niveau$Wert[niveau$Wert > threshold_clustering]# 3. Fit distribution over a range of thresholds and parameter estimation: GPDfitMLE <-fevd(as.vector(exceedances), method ="MLE", type="GP", threshold = threshold_clustering)# 4. Model Diagnosticsplot(fitMLE)
Code
# 5. Check if there is clustering of extremes (Ferro-Segers estimate)extremalindex(niveau$Wert, threshold_clustering) # 1 / 0.149 = 6.7; we have a very large value (which is very logical as we are dealing with precipitation data), indicating a need to decluster## ## Intervals Method Estimator for the Extremal Index## NULL## ## theta.tilde used because there exist inter-exceedance times > 2.## extremal.index number.of.clusters run.length ## 0.149419682 60.000000000 13.000000000# 6. Declustering years <-c()k <-1for(i in1:nrow(niveau)){if(is.na(niveau$Wert[i])){next }else{ years[k] <-year(niveau$Zeitstempel[i]) k <- k+1 }}years <- years-1999decl1 <-decluster(as.vector(niveau$Wert), threshold = threshold_clustering, groups=years, na.action=na.omit)plot(decl1)
Code
# Add years and decl1 to niveau data frame niveau$years <- yearsniveau$decl1 <- decl1# 7. Estimate Return Levels with declustered data, using cmax = T# Assuming 1 observation per day, we prefer using fpot from evdrl_mle_evd_50 <-fpot(as.vector(niveau$Wert), cmax = T, threshold = threshold_clustering, npp =365.25, mper=50)rl_mle_evd_100 <-fpot(as.vector(niveau$Wert), cmax= T, threshold = threshold_clustering, npp =365.25, mper=100)rl_mle <-return.level(fitMLE, conf =0.05, return.period=c(2,5,10,20,50,100), do.ci=T)# Plotting the 50-year and 100-year return levelpar(mfrow =c(2, 2))plot(rl_mle_evd_50)title(main ="50-Year Return Level", line =-1, col.main ="red")
Code
plot(rl_mle_evd_100)title(main ="100-Year Return Level", line =-1, col.main ="red")
Code
par(mfrow =c(1, 1))
This threshold was determined by examining the mean residual life plot, where 326.9 meters emerged as a critical point where the mean excess values begin to show significant variation. This value represents a transition in the data, marking where the mean excess values start to widen noticeably. This observation is further substantiated by the fact that 326.9 meters corresponds to the 95th percentile of your dataset, indicating it is a level exceeded by only the top 5% of your observations.
2.2.6 (f) Explain the drawbacks and advantages of using a block maxima method instead of the one used in c)-e).
2.3 Part 3: Night temperatures in Lausanne
2.3.1 (a) Read in the data for the daily night maximum temperatures in Lausanne. Subset the summer months (June to September).
Code
night_max <-read_csv(here::here("data/practical_2/nightmax.csv")) ## New names:## Rows: 8263 Columns: 4## -- Column specification## --------------------------------------------- Delimiter: "," chr## (1): station dbl (2): ...1, night.max date (1): date## i Use `spec()` to retrieve the full column specification for this## data. i Specify the column types or set `show_col_types = FALSE` to## quiet this message.## * `` -> `...1`night_min <-read_csv(here::here("data/practical_2/nightmin.csv"))## New names:## Rows: 8271 Columns: 4## -- Column specification## --------------------------------------------- Delimiter: "," chr## (1): station dbl (2): ...1, night.min date (1): date## i Use `spec()` to retrieve the full column specification for this## data. i Specify the column types or set `show_col_types = FALSE` to## quiet this message.## * `` -> `...1`# Remove rows with missing valuesnight_max <-na.omit(night_max)night_min <-na.omit(night_min)# Subset the data for summer months (June to September)summer_night_max <- night_max[format(night_max$date,"%m") %in%c("06", "07", "08", "09"), ]
2.3.2 (b) Assess whether extremes of the subsetted series in (a) occur in cluster.
Code
# Visualise the data: plot and histogram of the maximum temperatures for the summer months summer_night_max$date_str <-format(summer_night_max$date, "%d %b %Y") interactive_plot <-plot_ly(summer_night_max, x =~date, y =~night.max, type ='scatter', mode ='lines+markers',hoverinfo ='text', text =~paste('Date:', date_str, '<br>Temp:', night.max)) %>%layout(title ="Summer Night Max Temperatures",xaxis =list(title ="Date"),yaxis =list(title ="Night Max Temperature"))interactive_plot
Code
hist(summer_night_max$night.max, breaks =30, col ="skyblue", xlab ="Night Max", ylab ="Frequency", main ="Histogram of summer night max temperatures")
Code
# Have more information about the distribution min(summer_night_max$night.max) ; mean(summer_night_max$night.max); max(summer_night_max$night.max)## [1] 5.6## [1] 21.9083848## [1] 34.6quantile(summer_night_max$night.max, 0.95) # doesn't seem to be very right-skewed ## 95% ## 30# Choose the threshold using the mrlplot()mrlplot(summer_night_max$night.max, main="Mean residual")
Code
threshrange.plot(summer_night_max$night.max, r=c(20,30), nint =20) # we are going to use 30
Code
th_1 <-30# with this threshold we would have 5% of observations: 400 so seems good. # Visualise the threshold plot(summer_night_max$night.max, type ='l')abline(h=th_1,col=2) # looks good
Code
# Assess the thresholdpot_mle <-fevd(summer_night_max$night.max, method ="MLE", type="GP", threshold=th_1) plot(pot_mle)
Code
rl_mle <-return.level(pot_mle, conf =0.05, return.period=c(2,5,10,20,50,100), do.ci=T) # diagnostic plot looks good and confidence interval become wider the longer the return level, which make sense # Return level plots with MLE par(mfcol=c(1,1))plot(pot_mle, type="rl",main="Return Level Plot for Oberwang w/ MLE",ylim=c(0,200), pch=16)loc <-as.numeric(return.level(pot_mle, conf =0.05, return.period=50))segments(50, 0, 50, loc, col='midnightblue',lty=6)segments(0.01,loc,50, loc, col='midnightblue', lty=6)
Code
# Assess wether extremes occur in a cluster using extremalindex()extremalindex(night_max$night.max, th_1) # 0.31 so 1/0.31 = 3.23 WE NEED TO DECLUSTER ## ## Intervals Method Estimator for the Extremal Index## NULL## ## theta.tilde used because there exist inter-exceedance times > 2.## extremal.index number.of.clusters run.length ## 0.309860203 42.000000000 7.000000000
We have an approximate mean cluster size of 3.23, indicating that extremes values are clustered together. Therefore, we need to decluster.
2.3.3 (c) Decluster the data from (a) using a suitable threshold. Plot the resulting declustered data. (Hint: you may want to use the extRemes package.)
Code
# We need to create a vector of size summer_night_max that stores the year. years_part3 <-c()k <-1for (i in1:nrow(summer_night_max)) {if (is.na(summer_night_max$night.max[i])) {next } else { years_part3[k] <-year(summer_night_max$date[i]) k <- k +1 }}years_part3 <- years_part3-1999# Use decluster function of extRemes decl1 <-decluster(summer_night_max$night.max, threshold=th_1, groups=years_part3, na.action=na.omit) # vector and groups need to have the same size decl1 # we have 71 clusters ## ## summer_night_max$night.max declustered via runs declustering.## ## Estimated extremal index (intervals estimate) = 0.741502152 ## ## Number of clusters = 71 ## ## Run length = 1plot(decl1) # shows in grey the points that are not retained in the selection
2.3.4 (d) Fit a GPD to the data, both raw and declustered. Assess the quality of the fit.
Code
# Use fevd of the normal and declustered gpd_raw <-fevd(summer_night_max$night.max, threshold = th_1, type ="GP")gpd_declustered <-fevd(decl1, threshold = th_1, type ="GP")par(mfrow =c(2, 2))plot(gpd_raw)title(main ="GPD fitted to raw data", line =1, col.main ="red")
Code
plot(gpd_declustered)title(main ="GPD fitted to declustered data", line =1, col.main ="red")
Code
par(mfrow =c(1, 1))# Assess the fit: AIC for raw is 369, AIC for declustered data is 193
For the raw data, the adherence of the points to the 1:1 line in the Quantile-Quantile (Q-Q) plot suggests a superior fit compared to the declustered data. Furthermore, the lower AIC value for the raw data implies that it provides a more efficient balance between model complexity and goodness of fit. This indicates that the model for the raw data may be more appropriate for capturing the underlying distribution of the dataset.
2.3.5 (e) Repeat the above analysis for the negatives of the daily nightly minimum temperatures for the winter months (November-February).
As we are dealing with negative values (min), we need to invert the night.min values to apply the Peak-Over-Threshold. ::: callout-note ## Note To correctly subset winter months in night.min considering that winter spans from November and December of year t and January and February of year t+1, we need to adjust the selection process. ::: ::: {.cell}
Code
# Assuming 'night_min' has a 'date' column in Date formatnight_min$year <-as.numeric(format(night_min$date, "%Y"))night_min$month <-format(night_min$date, "%m")night_min <- night_min %>%mutate(year =as.numeric(format(date, "%Y")),month =format(date, "%m"),season_year =ifelse(month %in%c("01", "02"), year -1, year) )# Subset for winter monthswinter_night_min <- night_min %>%filter( (month %in%c("11", "12") & season_year == year) | (month %in%c("01", "02") & season_year == year -1) )# Add the month in season-year winter_night_min <- winter_night_min %>%mutate(day =day(date), full_date =paste(day, month, season_year, sep ="-"))# Convert into date formatwinter_night_min <- winter_night_min %>%mutate(full_date =as.Date(full_date, format ="%d-%m-%Y"))
::: ::: callout-note ## Note To properly implement the Peak-Over-Threshold method with negative values, the data frame should be inverted so that the negative values are positioned at the top. :::
Code
# Invert night.min as temperatures are negative winter_night_min$night.min <--winter_night_min$night.min# Visualise the data: plot and histogramp <-ggplot(winter_night_min, aes(x = full_date , y = night.min)) +geom_line() +# This adds the line type plotlabs(x ="Date", y ="Night Min Temperature", title ="Winter Night Min Temperatures") +theme_minimal() # Optional: adds a minimal themeinteractive_plot_winter <-ggplotly(p)interactive_plot_winter
Code
hist(winter_night_min$night.min, breaks =30, col ="skyblue", xlab ="Frequency", ylab ="Night Min Temperature", main ="Winter Night Min Temperatures frequency")
Code
# Have more information about the distribution min(winter_night_min$night.min); mean(winter_night_min$night.min); max(winter_night_min$night.min)## [1] -13.6## [1] -2.8185762## [1] 13.4quantile(winter_night_min$night.min, 0.95)## 95% ## 3.3# Choose the threshold using the mrlplot()mrlplot(winter_night_min$night.min, main="Mean residual")
th_2 <-3.3# Visualise the threshold plot(winter_night_min$night.min, type ='l')abline(h=th_2,col=2) # looks good
Code
# Assess the thresholdpot_mle_2 <-fevd(winter_night_min$night.min, method ="MLE", type="GP", threshold=th_2) plot(pot_mle_2)
Code
rl_mle <-return.level(pot_mle_2, conf =0.05, return.period=c(2,5,10,20,50,100), do.ci=T) # diagnostic plot looks good and confidence interval become wider the longer the return level, which make sense # Return level plots with MLE par(mfcol=c(1,1))plot(pot_mle_2, type="rl",main="Return Level Plot for Oberwang w/ MLE",ylim=c(0,200), pch=16)loc <-as.numeric(return.level(pot_mle_2, conf =0.05, return.period=50))segments(50, 0, 50, loc, col='midnightblue',lty=6)segments(0.01,loc,50, loc, col='midnightblue', lty=6)
Code
# Assess wether extremes occur in a cluster using extremalindex()extremalindex(winter_night_min$night.min, th_2) # 0.32 so 1/0.32 = 3.125## ## Intervals Method Estimator for the Extremal Index## NULL## ## theta.tilde used because there exist inter-exceedance times > 2.## extremal.index number.of.clusters run.length ## 0.32095905 40.00000000 5.00000000########################### Declustering ########################### # We need to create a vector of size night_min that stores the year. years_part3_2 <-c()k <-1for (i in1:nrow(winter_night_min)) {if (is.na(winter_night_min$night.min[i])) {next } else { years_part3_2[k] <-year(winter_night_min$date[i]) k <- k +1 }}years_part3_2 <- years_part3_2-1999# Use decluster function of extRemes decl2 <-decluster(winter_night_min$night.min, threshold=th_2, groups=years_part3_2, na.action=na.omit) # vector and groups need to have the same size decl2 # we have 71 clusters ## ## winter_night_min$night.min declustered via runs declustering.## ## Estimated extremal index (intervals estimate) = 0.832497359 ## ## Number of clusters = 53 ## ## Run length = 1plot(decl2) # shows in grey the points that are not retained in the selection
Code
# Fit GPD with clustered and declustered data # Use fevd of the normal and declustered gpd_raw_2 <-fevd(winter_night_min$night.min, threshold = th_2, type ="GP")gpd_declustered_2 <-fevd(decl2, threshold = th_2, type ="GP")par(mfrow =c(2, 2))plot(gpd_raw_2)title(main ="GPD fitted to raw data", line =1, col.main ="red")
Code
plot(gpd_declustered_2)title(main ="GPD fitted to declustered data", line =1, col.main ="red")
Code
par(mfrow =c(1, 1))# Assess the fit: AIC for raw is 38100, AIC for declustered data is 193
3 Practical 3
Code
source(here::here("script/setup.R"))
3.1 Part 1
3.1.1 (a) Read in the data and select only the winter temperature data. You can use the code lines of the file GetData.R to get the data and select only the winter values.
Code
##### Download daily NAO measurementsNAO.daily <-fread('ftp://ftp.cdc.noaa.gov/Public/gbates/teleconn/nao.reanalysis.t10trunc.1948-present.txt')NAO.daily <-as.matrix(NAO.daily)colnames(NAO.daily) <-c("year","month","day","NAO")##### Download temperature data: be sure to work in the correct directorymonths <-c(12,1,2) #keep only winter observationstemp_max_Zermatt <-read_delim(here::here("data/practical_3/daily_maximum_Zermatt/order_107669_data.txt"), ";", escape_double =FALSE, col_types =cols(time =col_number()), trim_ws =TRUE, skip =1)colnames(temp_max_Zermatt) <-c("station","time","temp")temp_max_Zermatt <- temp_max_Zermatt[-1,]temp_max_Zermatt[,2] <-as.Date(apply(temp_max_Zermatt[,2],1,as.character),"%Y%m%d")temp_max_Montana <-read_delim(here::here("data/practical_3/daily_maximum_Montana/order_107668_data.txt"), ";", escape_double =FALSE, col_types =cols(time =col_number()), trim_ws =TRUE, skip =1)colnames(temp_max_Montana) <-c("station","time","temp")temp_max_Montana <- temp_max_Montana[-1,]temp_max_Montana[,2] <-as.Date(apply(temp_max_Montana[,2],1,as.character),"%Y%m%d")###match the dates of the two time seriestemp_max_Montana <- temp_max_Montana[match(as.matrix(temp_max_Zermatt[,2]),as.matrix(temp_max_Montana[,2])),]temp_max_Montana <-as.matrix(temp_max_Montana)colnames(temp_max_Montana) <-c("station","time","temp")temp_max_Zermatt <-as.matrix(temp_max_Zermatt)colnames(temp_max_Zermatt) <-c("station","time","temp")###keep only winter datestemp_max_Montana <- temp_max_Montana[which(month(as.POSIXlt(temp_max_Montana[,"time"], format="%Y-%m-%d")) %in% months),]temp_max_Zermatt <- temp_max_Zermatt[which(month(as.POSIXlt(temp_max_Zermatt[,"time"], format="%Y-%m-%d")) %in% months),]Date <-function( length =0 ){ newDate =numeric( length )class(newDate) ="Date"return(newDate)}season_day <-yday(as.Date(temp_max_Montana[,2]))season_day[season_day <61] <- season_day[season_day <61] +31season_day[season_day >334] <- season_day[season_day >334]-334NAO.date <-Date(nrow(NAO.daily))for(i in1:nrow(NAO.daily)){ NAO.date[i] <-as.Date(paste(as.character(NAO.daily[i,1]),"-",as.character(NAO.daily[i,2]),"-",as.character(NAO.daily[i,3]),sep=""),format="%Y-%m-%d")}NAO.daily <-mutate(as.tibble(NAO.daily), date =make_date(year, month, day))index <-as.Date(intersect(as.Date(temp_max_Montana[,2]),as.Date(NAO.date)))nao <- NAO.daily[which(NAO.daily$date %in% index), 4]nao <-as.vector(nao)#Montanax_Montana <-data.frame("time"=temp_max_Montana[,2],"nao"=nao,"d"=season_day,"temp"=temp_max_Montana[,3])as.numeric.factor <-function(x) {as.numeric(levels(x))[x]}x_Montana[,"temp"] <-as.numeric(x_Montana[,"temp"])x_Montana[,"time"] <-as.numeric(year(as.POSIXlt(x_Montana[,"time"], format="%Y-%m-%d")))x_Montana[,"time"] <- (x_Montana[,"time"]-min(x_Montana[,"time"]))/(max(x_Montana[,"time"])-min(x_Montana[,"time"]))#Zermattx_Zermatt <-data.frame("time"=temp_max_Zermatt[,2],"nao"=nao,"d"=season_day,"temp"=temp_max_Zermatt[,3])as.numeric.factor <-function(x) {as.numeric(levels(x))[x]}x_Zermatt[,"temp"] <-as.numeric(x_Zermatt[,"temp"])x_Zermatt[,"time"] <-as.numeric(year(as.POSIXlt(x_Zermatt[,"time"], format="%Y-%m-%d")))x_Zermatt[,"time"] <- (x_Zermatt[,"time"]-min(x_Zermatt[,"time"]))/(max(x_Zermatt[,"time"])-min(x_Zermatt[,"time"]))Z <-data.frame("Montana"=x_Montana[,"temp"] , "Zermatt"=x_Zermatt[,"temp"], "NAO"=x_Zermatt[,"NAO"])
3.1.2 (b) Represent the data and especially scatterplots of Zermatt values against Montana’s values. Same with Montana values against NAO values. Comment.
# Scatterplot of Zermatt values against Montana valuesplot(Z$Montana, Z$Zermatt, xlab ="Montana Temperature", ylab ="Zermatt Temperature", main ="Zermatt vs Montana Temperatures")abline(lm(Z$Zermatt ~ Z$Montana), col ="red") legend("topright", legend ="Fitted Regression Line", col ="red", lty =1, cex =0.8)
Code
# Scatterplot of Montana values against NAO valuesplot(Z$Montana, Z$NAO, xlab ="Montana Temperature", ylab ="NAO Index", main ="Montana Temperature vs NAO Index")abline(lm(Z$NAO ~ Z$Montana), col ="red") # Add a linear regression linelegend("topright", legend ="Fitted Regression", col ="red", lty =1, cex =0.8)
Zermatt vs Montana Temperatures: scatterplot displays a discernible pattern, indicating a positive correlation between the temperatures in Zermatt and Montana. The red line, which delineates the linear regression model applied to the data, closely follows the trajectory of the data points. This suggests a robust positive linear association between the temperatures recorded in the two regions.
Montana Temperature vs NAO Index: scatterplot presents a dense concentration of data points that do exhibit a slight upward trend. The red line, which represents the fitted linear regression model, appears to be horizontal, suggesting little linear relationship between the temperature in Montana and the NAO Index. The slightly positive slope in the regression line implies that, based on this model, changes in the NAO Index correspond to predictable or significant changes in Montana’s temperature. This could indicate that the NAO Index may be a predictor of temperature in Montana.
3.1.3 (c) Test the correlation between these two pairs of series.
cor_Zermatt_Montana <-cor(Z$Zermatt, Z$Montana)cat("Correlation between Zermatt and Montana Temperatures:", cor_Zermatt_Montana, "\n")## Correlation between Zermatt and Montana Temperatures: 0.95456767cor.test(Z$Zermatt, Z$Montana)## ## Pearson's product-moment correlation## ## data: Z$Zermatt and Z$Montana## t = 194.7707, df = 3697, p-value < 0.000000000000000222## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:## 0.951614981 0.957344110## sample estimates:## cor ## 0.95456767
Code
# Correlation between Montana Temperatures and NAO Indexcor_Montana_NAO <-cor(Z$Montana, Z$NAO)cat("Correlation between Montana Temperatures and NAO Index:", cor_Montana_NAO, "\n")## Correlation between Montana Temperatures and NAO Index: 0.254067152# correlation test cor.test(Z$Montana, Z$NAO) # significant ## ## Pearson's product-moment correlation## ## data: Z$Montana and Z$NAO## t = 15.97214, df = 3697, p-value < 0.000000000000000222## alternative hypothesis: true correlation is not equal to 0## 95 percent confidence interval:## 0.223670701 0.283969870## sample estimates:## cor ## 0.254067152
Correlation between Montana and Zermatt temperatures: correlation coefficient of 0.95 underscores an exceptionally strong positive relationship, consistent with the anticipation of a significant linkage between their climatic patterns. The statistical significance of this correlation has been affirmed by the cor.test, solidifying the inference that the temperatures in these two regions move in a closely synchronized manner.
Correlation between Montana and NAO: observed to be 0.25. This represents a relatively weak positive correlation, suggesting that while there is some degree of association between the two variables, it is not substantial. cor.test indicated that the correlation is also significant.
3.1.4 (d) Now concentrate on the extreme level and tail dependence. Represent χ and χ̄ for two pairs of series (Montana vs Zermatt and Montana vs NAO). You may want to use the function chiplot of the package evd. What do you observe?
# Create a data frame for Zermatt vs Montana Temperaturesdata_Zermatt_Montana <-data.frame(Zermatt = Z$Zermatt, Montana = Z$Montana)# Chi Plot - Zermatt vs Montana Temperatureschiplot(data_Zermatt_Montana, main1 ="Chi Plot - Zermatt vs Montana Temperatures")
Code
# Create a data frame for Montana Temperatures vs NAO Indexdata_Montana_NAO <-data.frame(Montana = Z$Montana, NAO = Z$NAO)# Chi Plot - Montana Temperatures vs NAO Indexchiplot(data_Montana_NAO, main1 ="Chi Plot - Montana Temperatures vs NAO Index")
Chi Bar Plot of Zermatt vs Montana temperatures: indicates the existence of tail dependence, revealing that there is a significant relationship between their extreme values. The plot demonstrates a consistent deviation from zero across all quantiles, with the measure of dependence (lambda) increasing towards 1. This pattern suggests that the extreme temperatures in Zermatt and Montana are not independent but rather exhibit a linked behavior, particularly in the tails of their respective distributions.
Chi Bar Plot of Montana vs NAO: suggests an absence of tail dependence, as evidenced by the lambda statistic equating to zero. The plot’s trajectory hovers around the zero mark, indicating that the two variables are likely to act independently at their extremes. Consequently, this implies that extreme values in Montana’s temperatures and the NAO indices do not occur in conjunction and thus do not exhibit simultaneous extreme behavior.
3.2 Part 2
3.2.1 (a) Read in the data. Then, plot the raw prices (Adj.Close) of both stocks. Discuss the stationarity assumption.
Code
engie <-read.csv(here::here("data/practical_3/engie.csv"))veolia <-read.csv(here::here("data/practical_3/veolia.csv"))# Data cleaning and plotengie <- engie[-c(4456, 4489), ]veolia <- veolia[-c(4456, 4489), ]engie$Adj.Close <-as.numeric(engie$Adj.Close)veolia$Adj.Close <-as.numeric(veolia$Adj.Close)par(mfrow =c(1,2))plot(engie$Adj.Close)plot(veolia$Adj.Close)
Code
# ADF test to asses the stationarityadf_engie <-adf.test(engie$Adj.Close) #p-value = 0.4466adf_veolia <-adf.test(veolia$Adj.Close)#p-value = 0.8228
Since our p-value is higher that 0.05 for both data sets, we cannot reject the null hypothesis of non-stationarity of the time series.
3.2.2 (b) First, compute and plot the negative log returns in independent plots. Discuss the dates of occurrence of extreme events. Then, produce a scatterplot of the negative log returns of both stocks and discuss their (visual) dependence.
Code
# Create log return functioncalculate_log_returns <-function(stocks) { log_returns <--diff(log(stocks))return(log_returns)}# Calculate negative log returns for each indexlog_returns_engie <-calculate_log_returns(engie$Adj.Close) plot(log_returns_engie)
plot_engie <-plot_ly(x = engie$Date, y = engie$Adj.Close, type ="scatter", mode ="point", name ="Engie") %>%layout(title ="Engie")plot_veolia <-plot_ly(x = veolia$Date, y = veolia$Adj.Close, type ="scatter", mode ="point", name ="Veolia") %>%layout(title ="Veolia")combined_plot_2 <-subplot(plot_engie, plot_veolia)layout(combined_plot_2, title ="Comparison of stock indices")
At first sight, we see that the log returns of the indices for both companies follow a similar trend, the extremes happened at more or less same periods and the volatility is very high. The extreme events happened in 2008 (subprime crisis), 2012-2013 (european energy crisis), 2020 (COVID-19), and 2022 (Energy crisis).
Code
# Drawing a scatter plotpar(pty="s")scatter.smooth(x = log_returns_engie, y = log_returns_veolia, xlab ="Engie negative log returns", ylab ="Veolia negative log returns")
Our scatter plot shows a strong dependance between the negative log returns. No heavy tails displayed, and the shape is close to eliptical.
3.2.3 (c) Plot the pseudo-samples obtained from estimating the marginal distributions by their empirical CDFs.
Code
# We compute and show the CDF for both negative log returns cdf_engie <-ecdf(log_returns_engie)cdf_veolia <-ecdf(log_returns_veolia)plot(cdf_engie, col ="red", main ="Empirical Cumulative Distributions")lines(cdf_veolia, col ="blue")legend("bottomright", legend =c("Engie", "Veolia"), col =c("red", "blue"), lty =1)
Code
# Probabilities for cdf_engie and cdf_veoilau1 <-pobs(log_returns_engie)u2 <-pobs(log_returns_veolia)data <-data.frame(engie_prob = u1, veolia_prob = u2)data_matrix <-as.matrix(data)pairs.panels(data, method ="kendall")
We can see that both empirical cumulative distribution functions are very similar for both Veolia and Engie. The Kendall method plot shows us a correlation coefficient of 0.42, indicating a fairly high correlation
3.2.4 (d) Fit a Gaussian, a t, a Gumbel and a Clayton copula to the data. Select the copula model using an information criterion, such as the AIC.
Code
# Fit Gaussian copulagaussian_copula <-normalCopula(dim =2)fit_gaussian <-fitCopula(gaussian_copula, data, method ="mpl")# Fit t copulat_copula <-tCopula(dim =2, df =3)fit_t <-fitCopula(t_copula, data, method ="mpl")# Fit Gumbel copulagumbel_copula <-gumbelCopula(dim =2)fit_gumbel <-fitCopula(gumbel_copula, data, method ="mpl")# Fit Clayton copulaclayton_copula <-claytonCopula(dim =2, param =2) fit_clayton <-fitCopula(clayton_copula, data, method ="mpl")# Calculate AIC valuesaic_values <-AIC(fit_gaussian, fit_t, fit_gumbel, fit_clayton)aic_values## df AIC## fit_gaussian 1 -1939.37785## fit_t 2 -2292.85549## fit_gumbel 1 -2126.49971## fit_clayton 1 -1246.12729min(aic_values$AIC) ###the t distribution fits the best our data## [1] -2292.85549BiCopSelect(data[,1], data[,2], familyset=NA)## Bivariate copula: t (par = 0.61, par2 = 3.92, tau = 0.42)
From the AIC results, we can conclude that the t copula is the best one for our data.
3.2.5 (e) Compute the theoretical tail dependence coefficients for the copulae fitted in (d), using the parameters obtained above.
Code
#Compute the tail dependence parameters through BiCopPar2TailDep and lambda functionsBiCopPar2TailDep(family =2, par = fit_t@copula@parameters[1], par2 = fit_t@copula@parameters[2])## $lower## [1] 0.326725819## ## $upper## [1] 0.326725819
We obtain a lower and upper tail dependence of approximately 0.327 for both, which makes sense since the chosen copula is a t-copula (symmetric distribution). This indicates that there is a positive correlation between the extreme values of Engie and Veolia.
3.2.6 (f) Compute the non-parametric estimators of the tail dependence coefficients obtained from the data. Are these estimates close to the theoretical ones obtained with the best model selected in (d)?
From the results of fitLambda, we can observe very similar results to the t-copulae. To ensure these results, we drew a chi plot, which graphically seems to confirm these results. From this information, and the previously done work, we can conclude that there is a positive dependance between the extreme events of Engie and Veolia’s stocks.